This analysis explores the effects of different chemicals in white wine on the taste and reported quality of individual wines.
A study from the University of Minho detailed the chemical profiles and subjective rating of several different wines. The full profile of the data, collection methods and descriptions of objective and subjective attributes can be found here: link
What makes a good wine, and whether a wine can be considered objectively superior to others, is the subject of intense debate. Some studies suggest that the perception of a wine’s quality is based more on the container or price than the beverage itself — for a summary of some statistical analyses of wine tasting competitions, check out this article link.
But many of these analyses do not take the chemical profile of the wines into account. In this analysis, I will be exploring whether the chemically measured aspects of the individual wine samples had any meaningful effect on the wine’s rating.
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
This dataset contains 4898 observations - so almost 5000 individual wine samples - of 13 variables. Eleven of those are the chemical profiles of the wines, one is the score the wine has been given, and X is the sample number. This dataset seems very tidy, so no further cleaning is needed.
While “quality” could have easily been an ordered factor, it was recorded as an int. For now, however, this presentation serves our purpose just fine - so lets get a summary.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Wine samples were scored on a scale from 0 to 10. Reported scores in the dataset ranged from 3 to 9, with a median of 6 and an average of 5.878 (wines were given whole number scores). Time for a histogram. This normally distributed plot makes it clear that the vast majority of wine samples are rated slightly above average, with only around 400 samples in total scoring either below 5 or above 7.
Only 5 wines scored a 9. I wonder how their chemical compositions compare to the entire dataset’s.
While overlaying the averages of the top scoring wines on the histograms of the entire samplings’ data by variable would be enlightening, it might also be a little overkill for the purposes of this exploration, so let’s just compare summary ranges and means for now.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 775.0 Min. :6.60 Min. :0.240 Min. :0.290
## 1st Qu.: 821.0 1st Qu.:6.90 1st Qu.:0.260 1st Qu.:0.340
## Median : 828.0 Median :7.10 Median :0.270 Median :0.360
## Mean : 981.4 Mean :7.42 Mean :0.298 Mean :0.386
## 3rd Qu.: 877.0 3rd Qu.:7.40 3rd Qu.:0.360 3rd Qu.:0.450
## Max. :1606.0 Max. :9.10 Max. :0.360 Max. :0.490
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 1.60 Min. :0.0180 Min. :24.0 Min. : 85
## 1st Qu.: 2.00 1st Qu.:0.0210 1st Qu.:27.0 1st Qu.:113
## Median : 2.20 Median :0.0310 Median :28.0 Median :119
## Mean : 4.12 Mean :0.0274 Mean :33.4 Mean :116
## 3rd Qu.: 4.20 3rd Qu.:0.0320 3rd Qu.:31.0 3rd Qu.:124
## Max. :10.60 Max. :0.0350 Max. :57.0 Max. :139
## density pH sulphates alcohol
## Min. :0.9897 Min. :3.200 Min. :0.360 Min. :10.40
## 1st Qu.:0.9898 1st Qu.:3.280 1st Qu.:0.420 1st Qu.:12.40
## Median :0.9903 Median :3.280 Median :0.460 Median :12.50
## Mean :0.9915 Mean :3.308 Mean :0.466 Mean :12.18
## 3rd Qu.:0.9906 3rd Qu.:3.370 3rd Qu.:0.480 3rd Qu.:12.70
## Max. :0.9970 Max. :3.410 Max. :0.610 Max. :12.90
## quality
## Min. :9
## 1st Qu.:9
## Median :9
## Mean :9
## 3rd Qu.:9
## Max. :9
Described in the accompanying file as tartaric acid, “most acids involved with wine or fixed or nonvolatile”, and measured in g / dm^3.
The lowest value for fixed acidity was 3.8, the highest was 14.2, and the mean was 6.855.
The wines with 9s had a mean of 7.4, with quartiles at 6.9 and 7.4. These numbers are all slightly higher than the total mean – potentially significant, potentially just a sampling bias. Possible candidate for further investigation.
This is the amount of acetic acid in each wine sample in g / dm^3. Too much acetic acid can cause wine to taste like vinegar.The minimum level for all samples was 0.08, the maximum level was 1.1, and the mean was 0.27.
The top scorers here had a mean of 0.29, with a minimum and maximum of 0.24 and 0.36 respectively. In this regard, these wines seem average.
Described as adding the flavor “freshness” to a wine, the lowest amount found in the whole data set was 0.0000 g / dm^3. The most was 1.66, but the average was 0.32, with first and third quartiles clustered nearby at 0.27 and 0.39, respectively.
The minimum citric acid reported value for the top 5 was 0.29, and the top was 0.49. The average was 0.38 and the quartiles were 0.34 and 0.45 - potentially skewed towards slightly higher concentrations than the total population? It might be interesting to see if exceptionally low values correspond to lower quality ratings.
While this dataset included an apparent (superior) sweet wine or two, with a max reported value of 65.8 grams/liter (wines over 45 grams/liter are considered sweet), the mean residual sugar value was 6.391 g/l, with quartiles at 1.7 and 9.9. The minimum value was 0.6.
The gold standard wines, in contrast, had a mean of 4.12 g/l, with quartiles at 2.0 and 4.2. The lowest was 1.6, and the highest seemed to be a bit of an outlier at 10.6. It is interesting to note that the top rated wines mostly fell between the first quartile and mean of the total dataset.
The amount of salt in the wine, measured in g / dm^3. The total dataset had a minimum of 0.009, a maximum of 0.34, and an average of 0.04577.
The 5 best wines were all below the total average, with a mean of 0.0274. This seems like a potential candidate for a predictor of wine quality.
Important in preventing microbial growth and oxidation, the range for the dataset’s free SO2 values was from 2 to 289. The mean was 35.31 (values were reported in whole numbers here as well), and the quartiles were nearby at 23 and 46.
For the top tier wines, the mean was 33.4, with quartiles at 27 and 31, the minimum at 24, and the max potentially outlying at 57. While free sulfur dioxide might be passively important to insuring wine quality – by preventing negative aspects of aging such as oxidation – these values seem reasonably average.
The total amount of both bound and free forms of SO2 in the sample, measured in mg / dm^3. Variables ranged from 9 to 440, with an average of 138.4 and quartiles at 108 and 167.
The SO2 average for the best wines was, in contrast, 116, with a minimum of 85 and a maximum of 139. A lower, but not too low, total SO2 count seems potentially important for taste.
The density of most wine is close to water, but is affected by alcohol and sugar content. The average density of all wines was 0.994 g / cm^3, with quartiles at 0.9917 and 0.9961.
The quartiles for supreme sample density were at 0.9898 and 0.9906, and the mean was 0.9915 - skewed above the quartiles, most likely by the maximum value at 0.997. Something wonky may be going on here, but as density is affected by alcohol and sugar content, it might make more sense to check those variables’ effect on quality score first.
An average pH score for wine is between 3 and 4, and all wines tested fell between a pH of 2.72 and 3.82. The mean for all wines was 3.188.
The mean for the top scorers was 3.308, and all five samples fell above the total mean, ranging between 3.2 and 3.41. This is another case where the difference could be significant, or simply an effect of the small sample size.
Potassium sulphate is often added to wine to bolster SO2 levels, again to prevent microbe growth and wine oxidation. All wine samples fell between 0.22 g / dm^3 and 1.08 g / dm^3, with quartiles at 0.41 and 0.55. The average sulphate content of all samples was 0.48.
The top shelf samples fell between 0.36 and 0.61, with quartiles at 0.42 and 0.48 and an average of 0.46. This seems sufficiently average, and makes me wonder about the process of SO2 management in wine production - perhaps an industry standard amount determines the levels added, rather than a chemical assay on how much a young wine needs to prevent spoiling? If so, this could account for the widely fluctuating levels of total SO2 in the wines, as compared to the variation in other chemical ranges.
Arguably the chemical factor with the most agency to improve a wine’s score based on the number of other samples an expert has judged in one go. The range for all samples was between 8% by volume and 14%. The first and third quartiles were at 9.5% and 11.4%, and the average volume of alcohol was 10.51% for all samples.
The top scorers all fell between 10.4% and 12.9%, and the average was 12.18% — too much of a difference not to investigate given the physiopsychological effects of alcohol on humans.
Using only the data from the 5 best wines to select which chemical aspects to investigate has some obvious draw backs, especially the constant potential for artificially high or low means when compared to the total dataset means, I felt it was a good, quick way to identify which factors had the biggest potential to predict whether a wine would score exceptionally well. Since I am most interested in finding out which chemical aspects make a wine excel, and which are not correlated with better scores.
Another suspicion I have about this data is that some chemicals are likely more strongly related with a wine not being bad - such as, too much citric acid making a wine taste funny, but below a threshold falls to personal preference or is unreliably detected by the human tongue. This would be another, very interesting, investigation, and I suspect it might be far easier to correlate certain levels of chemicals to drinkable versus undrinkable, rather than trying to correlate certain levels of chemicals with excellence. That is not, however, the analysis I will do here.
The variables I am most interested in, after comparing quartiles and means, are Chloride, Total Sulfur Dioxide, alcohol, and pH. Let’s look a little closer with some histograms.
With a binwidth of 0.001, we see a tall, skinny but (mostly) normally distributed majority, with a bit of a tail starting arount 0.08. I’ll be interested to see the distribution of sample scores above and below that number.
Fairly normally distributed with a binwidth of 5, with just a handful of outliers on the high end.
This plot, which at first suggested a bit of modality, comes out as fairly regular around the mean (3.18) when binwidth is adjusted to 0.01.
This follows a low, slow right skew, with many more wines of around 9.5% alcohol. The most interesting breakdown of this to me so far would be to check the quality rankings between 9 and 9.5, between 9.5 and 11.5, and 12 to 12.5.
As a quick check, I want to do a correlation matrix.
It seems the strongest correlations in the dataset involve alcohol, density and sugar, as well as total sulfur dioxide. The highest correlation with quality is alcohol, at 0.4. Close behind are chlorides, total sulfur dioxide and density; however, as a liquid’s density is affected by alcohol content, density is also strongly related to alcohol.
I chose a violin plot for this variable after noticing a huge number of outliers and similar medians in a box plot. For the entire set of chloride levels, we see that the maximum density of observations lowers as quality increases, but that on the whole, the range where most observations exist stays fairly consistent. From the earlier histogram, I remember that on either side of 0.08, I wondered if I would see a trend in quality scores: lets break those down now.
Two histograms wrapped by new variable chloride.threshold, created using the cut() function, show that neither group clusters around either higher or lower scores; both can be said to focus around the median of quality scores.
This plot seems to occilate towards a narrower range as quality goes up. Lower scores having the largest quantiles and error bars, and the error geting smaller and smaller the higher the quality. The trend is especially strong between the biggest three groups of samples – score 5, 6, and 7 – and the medians and error bars do seem to narrow towards a point. This may be worth investigating in a linear model later on.
For density, I tested overlaying a violin plot with a box plot, to see how the mean changed as well as the distribution of observations around those medians. There does seem to be a trend towards higher qualities being slightly less dense - but, the density range of all the over-5 quality samples is somewhat wide. As density is affected by the alcohol and sugar content, it is unclear whether density is the affective variable or simply a side effect of preferred sugar and alcohol levels.
This is an interesting plot. The mean alcohol content between score 5 and 9 moves strongly upwards. Score 3 and 4 move downwards towards the local minimum at 5 - however, its worth noting that score 5 is the only category with more than two outliers, so the calculated mean may be artificially low. This might be worth a closer look.
This plot shows a distinct trend in the means. Despite the density and range of lower quality score alcohol levels, higher quality scores appear to more often have higher alcohol levels (around 12%) and lower quality scores tend to have lower alcohol levels (around 9.8%).
What I notice across all the graphs is that the means of all chemicals but alcohol are within quantiles of eachother, with error bars and outliers concentrating around the quality categories with the highest counts, as one would expect.
So, lets check out whether certain chemical balances score higher.
This graph shows a strong inverse linear relationship between density and alcohol, which makes sense, but along that trend, we see that higher quality categories tend towards lower density and higher alcohol percentages.
It appears that wines with higher quality scores may have somewhat less sulfur dioxide, although again, whether wines score better due to lower total SO2 or whether wines with more alcohol simply tend to have less SO2 is yet to be seen.
## [1] "high-quality (5+) model"
##
## Calls:
## hq1: lm(formula = alcohol ~ quality, data = high.quality)
## hq2: lm(formula = alcohol ~ quality + density, data = high.quality)
## hq3: lm(formula = alcohol ~ quality + density + residual.sugar, data = high.quality)
##
## =======================================================
## hq1 hq2 hq3
## -------------------------------------------------------
## (Intercept) 6.265*** 296.615*** 531.921***
## (0.118) (3.714) (5.895)
## quality 0.716*** 0.352*** 0.192***
## (0.020) (0.014) (0.012)
## density -289.920*** -526.695***
## (3.708) (5.919)
## residual.sugar 0.156***
## (0.003)
## -------------------------------------------------------
## R-squared 0.2 0.7 0.8
## adj. R-squared 0.2 0.7 0.8
## sigma 1.1 0.7 0.6
## F 1318.6 4571.9 5190.4
## p 0.0 0.0 0.0
## Log-likelihood -7107.3 -5146.1 -4247.3
## Deviance 5627.4 2449.2 1672.8
## AIC 14220.7 10300.3 8504.7
## BIC 14240.1 10326.1 8537.0
## N 4715 4715 4715
## =======================================================
##
## Call:
## lm(formula = alcohol ~ quality + density + residual.sugar, data = high.quality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8541 -0.3864 -0.0611 0.3571 15.6160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.319e+02 5.895e+00 90.24 <2e-16 ***
## quality 1.923e-01 1.192e-02 16.14 <2e-16 ***
## density -5.267e+02 5.919e+00 -88.99 <2e-16 ***
## residual.sugar 1.555e-01 3.326e-03 46.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5959 on 4711 degrees of freedom
## Multiple R-squared: 0.7677, Adjusted R-squared: 0.7676
## F-statistic: 5190 on 3 and 4711 DF, p-value: < 2.2e-16
## [1] "low-quality(3&4) model"
##
## Calls:
## lq1: lm(formula = chlorides ~ quality, data = subset(wineQualityWhites,
## as.numeric(quality) <= 4))
## lq2: lm(formula = chlorides ~ quality + density, data = subset(wineQualityWhites,
## as.numeric(quality) <= 4))
## lq3: lm(formula = chlorides ~ quality + density + alcohol, data = subset(wineQualityWhites,
## as.numeric(quality) <= 4))
##
## ================================================
## lq1 lq2 lq3
## ------------------------------------------------
## (Intercept) 0.067* -3.959*** -1.885
## (0.027) (0.799) (1.106)
## quality -0.004 -0.002 -0.004
## (0.007) (0.006) (0.006)
## density 4.040*** 2.036
## (0.801) (1.089)
## alcohol -0.007**
## (0.003)
## ------------------------------------------------
## R-squared 0.0 0.1 0.2
## adj. R-squared -0.0 0.1 0.1
## sigma 0.0 0.0 0.0
## F 0.4 12.9 11.3
## p 0.5 0.0 0.0
## Log-likelihood 390.8 402.9 406.5
## Deviance 0.1 0.1 0.1
## AIC -775.7 -797.8 -803.0
## BIC -766.0 -785.0 -786.9
## N 183 183 183
## ================================================
This plot highlights a trend present in most of the scientifically obtained variables - the wide range of outliers centered on the lower-middle end of the score. Chlorides especially exhibit an almost-flat line for quantiles, mean (red) and median (blue dashed) scores across quality levels.
This chart explored how a sample’s alcohol percentage related to its quality. With many more samples in the lower spectrums again, it was exciting to see a linear trend emerge. With fewer samples in the higher scores, we see the error increase along with the quality score - but our linear model supports that both of these can be used to predict a wine’s score.
This chart explores in more detail the bivariate trends at each quality level. While both Density and Total Sulfur Dioxide display more variety at score 5 and 6, a non-flat linear trend persists in the Density X Alcohol breakdown, whereas a non-flat linear trend does not emerge in higher quality levels for TSO2. Since Alcohol exhibited the strongest trend, you’d expect that if a particular level of TSO2 was an indicator of quality, you’d find that level paired with any one quality score more often than the others, and that is the opposite of what we see.
This dataset suggests, after scrutiny, that certain factors (alcohol, residual sugar, and density) may be more strongly correlated to a sample’s score than others (chlorides, sulfur dioxide, pH). With so few samples of exceptional wines, its hard to classify what makes a great wine as opposed to a mediocre or bad wine; however, my observation is that these three more-predictive variables could be classified as sort of “macro” flavors - flavors all humans can probably detect easily and accurately.
The dataset did not provide any ranges for humans’ ability to detect or distinguish between concentrations of any of the chemicals, let alone those that the study mentioned as “important” to a wine’s flavor. Smell plays a huge role in how we percieve taste, as well, and this dataset didn’t include any variables in that category. Since its highly possible that humans have perception thresholds for some of these chemicals that are much higher than the reported values, this dataset only represents a tiny fraction of the variables involved with tasting wine.
My initial reaction to this dataset was to worry that I would not uncover any strong relationships. As initial probes with histograms seemed to confirm this fear, I really struggled with how to select trends and relationships worth investigating, and debated for a time simply performing repetitive, standard graphs of all the variables to brute-force any possible relationships out.
As the data progressed and a few clear front-runners came out, I next grappled with the opposite issue - suddenly, only a few variables seemed worth looking at! All my graphs were, for a period, different permutations of Alcohol X Quality, trying to discover the best way to capture the relationship. Eventually I decided that I wanted to include some of the variables with uninteresting graphs - see those flat lines of total sulfur dioxide and messy chlorides - because I felt that they helped to justify making such a big deal out of so slight a relationship as that between alcohol and quality. It told the story I was seeing, that only slight trends were to be found, with more context than the alcohol X density X quality plots alone.
Another challenge was how to describe the alcohol X density X sugar relationship as it related to quality. I had in my mind that a three-dimensional histogram, colored by quality perhaps, or with a synthesized line showing some sort of quality-predicive plane, would be amazing. I tried plot3D, OceanView, and the generic R 3D modeling, but was unsatisfied with all of my experiments. That will be the next update I perform on this dataset. A larger dataset including the sample’s country of origin would be an exciting version of this - a multidimensional heat map linking the number of high-quality wines to the typical density profiles of the wines they produce.